Music segmentation
Contents
Music segmentation#
We have seen in the musicological introduction that we may come across different formats of Carnatic and Hindustani performances. These must be taken very much into account when designing strategies to segment the different sections in these musical pieces.
## Importing compiam to the project
import compiam
# Import extras and supress warnings to keep the tutorial clean
import os
from pprint import pprint
import warnings
warnings.filterwarnings('ignore')
Let’s first list the available tools for music segmentation in compiam.
compiam.structure.segmentation.list_tools()
['DhrupadBandishSegmentation*']
Dhrupad Bandish segmentation#
In this section we will showcase a tool that attempts to identify, through the use of rhythmic features, the different sections in a Dhrupad Bandish performances [MAVR20], one of the main formats in Hindustani music. As seen in the documentation, this segmentation model is based on PyTorch. Therefore, we proceed to install torch.
%pip install torch==1.8.0
This tool may be accessed from the structure.segmentation, however, the tool name has an * appended, therefore we can use the wrapper for models to rapidly initialize it with the pre-trained weights loaded.
Tip
Get the correct code for the wrapper by running compiam.list_models().
dbs = compiam.load_model("structure:dhrupad-bandish-segmentation")
In the documentation we observe that this model includes quite a number of attributes, and particularly we observe two of them that are interesting:
modefold
These attributes are important because define the training pipeline that has been used and therefore, a different mode of operating with this model. mode has options: net, voc, or pakh, which indicate the source for surface tempo multiple (s.t.m.) estimation. net mode is for input mixture signal, voc is for clean or source-separated singing voice recordings, and pakh for pakhawaj tracks (pakhawaj is a percussion instrument from Northern India). fold is basically an integer indicating with validation fold we do consider for training.
These configuration variables are loaded by default as net and 0 respectively, however these may be easily changed.
dbs.update_mode(mode="voc")
dbs.update_fold(fold=1)
At this moment, the mode and fold have been updated and consequently, the class has automatically loaded the model weights corresponding to mode=voc and fold=1.
Note
Typically in compiam, importing a model from the corresponding module or initializing it using the wrapper, can make an important difference on how the loaded instance works. Generally speaking, if you use the wrapper you will probably be only interested in running inference. If your goal is to train or deep-dive into a particular model, you should avoid the use of the model wrapper and start from a clean model instance.
Let’s now run prediction on an input file. Our mode now is voc, therefore the model expects a clean or source separated vocal signal. Isolated singing voice signals are not commonly available for the case of Carnatic and Hindustani music. We will use a state-of-the-art and out-of-the-box model, Spleeter, to try to separate the singing voice from the accompaniment.
%pip install spleeter
%pip install numba --upgrade
We will now directly download the pre-trained models for Spleeter, and use these for inference in this walkthrough. We will use wget (UNIX-based) to download the available pre-trained weights for Spleeter online.
!wget https://github.com/deezer/spleeter/releases/download/v1.4.0/2stems.tar.gz
We need to use tarfile to uncompress the downloaded file into a desired location. We will uncompres the downloaded model weights to the default location where Spleeter looks for the pretrained weights.
import tarfile
# Open file
file = tarfile.open("2stems.tar.gz")
# Creating directory where Spleeter looks for models by default
os.mkdir("pretrained_models/")
# Extracting files in tar
file.extractall(
os.path.join("pretrained_models", "2stems")
)
# Closing file
file.close()
Spleeter is based on TensorFlow. We disable the GPU usage and the TensorFlow related warnings just like we did in the pitch extraction walkthrough.
# Disabling tensorflow warnings and debugging info
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
# Importing tensorflow and disabling GPU usage
import tensorflow as tf
tf.config.set_visible_devices([], "GPU")
We may now load the Spleeter separator, which will automatically load the pre-trained weights for the model. We will use the 2:stems model, which has been trained to separate vocals and accompaniment.
from spleeter.separator import Separator
# Load default 2-stem spleeter separation
separator = Separator('spleeter:2stems')
The Separator class in Spleeter has a method to directly separate the singing voice from an audio file, and the prediction is stored in a given output folder. Let’s use this method and get a source separated version of our file.
# Separating!
separator.separate_to_file(
os.path.join(
"..", "audio", "mir_datasets", "CMR_full_dataset_1.0",
"audio", "10001_05_Thunga_Theera_Virajam.wav"
),
os.path.join("..", "audio")
)
Separation done! We can now run inference with the segmentation model on the source separated signal.
dbs.predict_stm(
path_to_file=os.path.join(
"..", "audio", "10001_05_Thunga_Theera_Virajam", "vocals.wav"
)
)
We can observe the estimated sections (given rhythmic characteristics) in the output image. The x axis provides information about the actual time-stamps for each estimation.
As a final experiment, let’s listen to the source separated file using Spleeter.
import IPython.display as ipd
ipd.Audio(
filename=os.path.join(
"..", "audio", "10001_05_Thunga_Theera_Virajam", "vocals.wav"
),
rate=44100,
) # In this case we play the audio directly from a file path